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CSJ ' Abstract 

We attempt to set a mathematical foundation of immunology and 
amino acid chains. To measure the similarities of these chains, a kernel 
on strings is defined using only the sequence of the chains and a good 
amino acid substitution matrix (e.g. BLOSUM62). The kernel is used 
in learning machines to predict binding affinities of peptides to hu- 
man leukocyte antigen DR (HLA-DR) molecules. On both fixed allele 
|24j and pan- allele [23] benchmark databases, our algorithm achieves 
the state-of-the-art performance. The kernel is also used to define a 
distance on an HLA-DR allele set based on which a clustering anal- 
^0 , ysis precisely recovers the serotype classifications assigned by WHO 

\^ | \14\ [22] . These results suggest that our kernel relates well the chain 

• structure of both peptides and HLA-DR molecules to their biological 

£^ . functions, and that it offers a simple, powerful and promising method- 

| ology to immunology and amino acid chain studies. 



1 Introduction 



Large scientific and industrial enterprises are engaged in efforts to produce 
new vaccines from synthetic peptides. The study of peptide binding to ap- 
propriate alleles is a major part of this effort. Our goal here is to support 
the use of a certain "string kernel" for peptide binding prediction as well as 
for the classification of supertypes of the major histocompatibility complex 
(MHC) alleles. 

Our point of view is that some key biological information is contained in 
just two places: first, in a similarity kernel (or substitution matrix) on the 
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set of the fundamental amino acids; and second, on a good representation of 
the relevant alleles as strings of these amino acids. Our results bear this out. 

This is achieved with great simplicity and predictive power. Along the 
way we find that gaps and their penalties in the string kernels don't help, 
and that emphasizing peptide binding as a real-valued function rather than 
a binding/non-binding dichotomy clarifies the issues. We use a modification 
of BLOSUM62 followed by a Hadamard power. We also use regularized 
least squares (RLS) in contrast to support vector machines as the former is 
consistent with our regression emphasis. 

We next briefly describe the construction (more details also in Section 
[2]) of our main kernel K 3 on amino acid chains, inspired by local alignment 
kernels (see e.g. [30]) as well as an analogous kernel in vision (see [38]) begins. 

For the purposes of this paper, a kernel K is a symmetric function K : X x 
X — y M. where X is a finite set. Given an order on X, K may be represented 
as a matrix (think of X as the set of indices of the matrix elements). Then 
it is assumed that K is positive definite (in such a representation). 

Let stf be the set of the 20 basic (for life) amino acids. Every protein has 
a representation as a string of elements of stf . 

Step 1. Definition of a kernel K 1 : srf x srf — > R. 

BLOSUM62 is a similarity (or substitution) matrix on £/ frequently 
used in immunology [13]. In the formulation of BLOSUM62, a kernel 
Q : stf x srf — y R is defined using blocks of aligned strings of amino 
acids representing proteins. One can think Q as the "raw data" of BLO- 
SUM62. It is symmetric, positive- valued, and a probability measure on 
srf x srf . (We have in addition checked that it is positive definite.) 

Let p be the marginal probability defined on stf by Q. That is, 

p( x ) = ^Q(x,y)- 

Next, we define the BLOSUM62-2 matrix, indexed by the set £/, as 

[BLOSUM62-2](„) = 

We list the BLOSUM62-2 matrix in Appendix |A] Suppose > is 
a parameter, usually chosen about | or ^ (still mysterious). Then a 
kernel K 1 : srf x srf — > M. is given by 

K\x,y) = ([BLOSUM62-2](a;, ? /)) /3 . (1) 

Note that the power in ([T|) is of the matrix entries, not of the matrix. 

Step 2. Let stf 1 = stf and define g/ k+1 = £^ k x stf recursively for any 
k G N. We say s is an amino acid chain (or string) if s G U'^L 1 £/ k , 
and s = (si, . . . , Sfc) is a /e-mer if s G stf k for some fceN with Si G stf . 
Consider 

k 

K 2 k (u,v) = l[K\u l ,v i ) 

i=l 

where u, v are amino acid strings of the same length k, u = (u\, . . . , u^), 
v — (vi, . . . ,ffe); u, v are fc-mers. K\ is a kernel on the set of all /c-mers. 
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Step 3. Let / = (/i, • • • , f m ) be an amino acid chain. Denote by |/| the 
length of / (so here |/| = m). Write u C / whenever u is of the form 
u = (/i+i) " ' ' j fi+k) f° r some l<i + l<i + k<m. Let g be another 
amino acid chain, then define 



K 3 (f,g)= £ 



uCf,vCg 
\u\=\v\=k 
all fc=l,2,... 

for / and g in any finite set X of amino acid chains. Here, and in all of 
this paper, we abuse the notation to let the sum count each occurrence 
of u in / (and of v in g). In other words we count these occurrences 
"with multiplicity". While u and v need to have the same length, not 
so for / and g. Replacing the sum by an average gives a different but 
related kernel. 

We define the correlation kernel K normalized from any kernel K by 

K(x,y) = . 

^/K(x,x)K(y,y) 

In particular, let K 3 be the correlation kernel of K 3 . 

Remark 1. K 3 is a kernel (see Section [2~2}) . It is symmetric, positive defi- 
nite, positive-valued; it is basic for the results and development of this paper. 
We sometimes say string kernel. The construction works for any kernel ( at 
the place of K 1 ) on any finite alphabet (replacing 



Remark 2. For some background see FT2} \29\ l31\ [7?| /. But we use no gap 

penalty or even gaps, no logarithms, no implied round-off s, and no alignments 
(except the BLOSUM62-2 matrix which indirectly contains some alignment 
information). Our numerical experiments indicate that these don't help in 
our context, (at least!). 

Remark 3. For complexity reasons one may limit the values of k in Step 3 
with a small loss of accuracy, or even choose the k-mers at random. 

Remark 4. The chains we use are proteins, peptides, and alleles. Peptides 
are short chain fragments of proteins. Alleles are realizations of genes in 
living organisms varying with the individual; as proteins they have represen- 
tations as amino acid chains. 

MHC II and MHC I are sets of alleles which are associated with immuno- 
logical responses to viruses, bacteria, peptides and related. See [201 HSj f° r 
good introductions. In this paper we only study HLA II, the MHC II in 
human beings. HLA-DRB (or simply DRB) describes a subset of HLA II 
alleles which play a central role in immunology, as well as in this paper. 



1.1 First Application: Binding Affinity Prediction 

Peptide binding to a fixed HLA II (and HLA I as well) molecule (or an allele) 
a is a crucial step in the immune response of the human body to a pathogen 
or a peptide-based vaccine. Its prediction is computed from data of the form 
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(xi,yi)iLi, Xi G & a and yi G [0, 1], where S? a is a set of peptides (i.e. chains 
of amino acids; in this paper we study peptides of length 9 to 37 amino 
acids, usually about 15) associated to an HLA II allele a. Here y^ expresses 
the strength of the binding of Xj to a. The peptide binding problem occupies 
much research. We may use our kernel K 3 described above for this problem 
since peptides are represented as strings of amino acids. Our prediction thus 
uses only the amino acid chains of the peptides, a substitution matrix, and 
some existing binding affinities (as "data"). 

Following RLS supervised learning with kernel K = K 3 , the main con- 
struction is to compute 

m 

f a = arg mm - y t ) 2 + X\\f\\ 2 K . (2) 

/ i 

Here A > and the index (3 > in K 3 are chosen by a procedure called 
leave-one-out cross validation. Also is the space of functions spanned 
by {K x : x G (where K x (y) := K(x,y)) on a finite set & of peptides 
containing 0P a . An inner product on M'k is defined on the basis vectors 
as (K x ,K y )jg, = K(x,y), then in general by linear extension. The norm 
of / G M'k induced by this inner product is denoted by ||/||^. In (j2J), f a 
is the predicted peptide binding function. We refer to this algorithm as 
"KernelRLS". 

For the set of HLA II alleles, with the best data available we have Table 
1. The area under the receiver operating characteristic curve (area under 
the ROC curve, AUC) is the main measure of accuracy used in the pep- 
tide binding literature. NN-W refers to the algorithm which up to now has 
achieved the most accurate results for this problem, although there are many 
previous contributions as [HI [TH [8] . In Section [2] there is more detail. 
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1 ,ig"I" r\T qM^Iog ft 


M- 02) 

ir^ a 


KernelRLS 
RMSE AUC 


NN-W in [24] 
AUC 


DRB1*0101 


5166 


0.18660 


0.85707 


0.836 


DRB1*0301 


1020 


0.18497 


0.82813 


0.816 


DRB1*0401 


1024 


0.24055 


0.78431 


0.771 


DRB1*0404 


663 


0.20702 


0.81425 


0.818 


DRB1*0405 


630 


0.20069 


0.79296 


0.781 


DRB1*0701 


853 


0.21944 


0.83440 


0.841 


DRB1*0802 


420 


0.19666 


0.83538 


0.832 


DRB1*0901 


530 


0.25398 


0.66591 


0.616 


DRB1*1101 


950 


0.20776 


0.83703 


0.823 


DRB1*1302 


498 


0.22569 


0.80410 


0.831 


DRB1*1501 


934 


0.23268 


0.76436 


0.758 


DRB3*0101 


549 


0.15945 


0.80228 


0.844 


DRB4*0101 


446 


0.20809 


0.81057 


0.811 


DRB5*0101 


924 


0.23038 


0.80568 


0.797 


Average 




0.21100 


0.80260 


0.798 


Weighted Average 




0.20451 


0.82059 


0.810 



Table 1: The algorithm performance of RLS on each fixed allele in the bench- 
mark [23] . If a is the allele in column 1, then the number of peptides in & a 
is given in column 2. The root-mean-square deviation (RMSE) scores are 
listed (see Section [2]). The AUC scores of the RLS and the NN-W algorithm 
are listed for comparison, where a common threshold 9 = 0.4256 is used [24] 
in the final thresholding step into binding and non-binding (see Section 12.31 
for the details). The best AUC in each row is marked in bold. In all the 
tables the weighted average scores are given by the weighting on the size 
j^S?a of the corresponding peptide sets 2? a . 

We note the simplicity and universality of the algorithm that is based 
on K 3 , which itself has this simplicity with the contributions from the sub- 
stitution matrix (i.e. BLOSUM62-2) and the sequential representation of 
the peptides. There is an important generalization of the peptide binding 
problem where the allele is allowed to vary. Our results on this problem are 
detailed in Section [3j 

1.2 Second Application: Clustering and Supertypes 

We consider the classification problem of DRB (HLA-DR f3 chain) alleles 
into groups called supertypes as follows. The understanding of DRB sim- 
ilarities is very important for the designation of high population coverage 
vaccines. An HLA gene can generate a large number of allelic variants and 
this polymorphism guarantees a population from being eradicated by an sin- 
gle pathogen. Furthermore, there are no more than twelve HLA II alleles in 
each individual [16] and each HLA II allele binds only to specific peptides 
[33l 143] . As a result, its difficult to design an effective vaccine for a large 
population. It has been demonstrated that many HLA molecules have over- 
lapping peptide binding sets and there have been several attempts to group 
them into supertypes accordingly [361 EH E3 [261 EH El S]- The supertypes 
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are designed so that the HLA molecules in the same supertype will have a 
similar peptide binding specificity. 

The Nomenclature Committee of the World Health Organization (WHO) 
[22] has given extensive tables on serological type assignments to DRB alle- 
les which are based on the works of many organizations and labs throughout 
the world. In particular the HLA dictionary 2008 by Holdsworth et al. [H] 
acknowledges especially the data from the WHO Nomenclature Committee 
for Factors of the HLA system, the International Cell Exchange and the 
National Marrow Donor Program. The text in Holdsworth et al., 2008 [2] 
indicates also the ambiguities of such assignments especially in certain sero- 
logical types. 

We define a set JV of DRB alleles as follows. We downloaded 820 DRB 
allele sequences from the IMGT/HLA Sequence Database [27] 0. And then 
14 non-expressed alleles were excluded and there remained 806 alleles. We 
use two markers "RFL" and "TVQ" , each of which consists of three amino 
acids to identify the polymorphic part of a DRB allele. For each allele, 
we only consider the amino acids located between the markers "RFL" (the 
location of the first occurrence of "RFL" ) and "TVQ" (the location of the last 
occurrence of "TVQ"). One reason is the majority of polymorphic positions 
occur in exon 2 of the HLA class II genes [IT], and the amino acids located 
between the markers "RFL" and "TVQ" constitute the whole exon 2 [40] . 
The DRB alleles are encoded by 6 exons. Exon 2 is the most important 
component constituting an HLA Il-peptide binding site. The other reason is 
in the HLA pseudo-sequences used in the NetMHCIIpan[2Sj, all positions of 
the allele contacting with the peptide occur in this range. 

Thus each allele is transformed into a normal form. We should note that 
two different alleles may have the same normal form. For those alleles with 
the same normal form, we only consider the first one. The order is according 
to the official names given by WHO. We collect the remaining 786 alleles 
with no duplicate normal forms into a set, we call JV . This set not only 
includes all alleles listed in the tables of [2], but also contains all new alleles 
since 2008 until August 2011. 

Thus JV may be identified with a set of amino acid sequences. Next 
impose the kernel K 3 above on JV where (3 = 0.06, we call the kernel K^y- 

On jV we define a distance derived from by 



Here and in the sequel we denote j^A the size of a finite set A. 

The DRB1*11 and DRB1*13 families of alleles have been the most dif- 
ficult to deal with by WHO and for us as well. Therefore we will exclude 
the DRB1*11 and DRB1*13 families of alleles in the following cluster tree 
construction with the evidence that clustering of these 2 groups is ineffective. 
They are left to be analyzed separatelyJl 

The set ^# consists of all DRB alleles except for the DRB 1*11 and 
DRB1*13 families of alleles. ^ is a subset of the set =yf . We produce a 

1 f tp : //ftp . ebi . ac .uk/pub/dat abases/ imgt/mhc/hla/DRB_prot . fast a 

2 We have found from a number of different experiments that "they do not cluster". 
Perhaps the geometric phenomenon here is in the higher dimensional scaled topology, i.e. 
the betti numbers bi > 0, for i > 0. 
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clustering of ^ based on the L 2 distance Djji restricted to and use 
the OWA (Ordered Weighted Averaging) [12] based linkage instead of the 
"single" linkage in the hierarchical clustering algorithm. 

This clustering uses no previous serological type information and no align- 
ments. We have assigned supertypes labeled ST1, ST2, ST3, ST4, ST5, ST6, 
ST7, ST8, ST9, ST10, ST51, ST52 and ST53 to certain clusters in the Tree 
shown in Figure 1 based on contents of the clusters described in Table 6. Pep- 
tides have played no role in our model. Differing from the artificial neural 
network method [2U[T1], no "training data" of any previously classified alle- 
les are used in our clustering. We make use of the DRB amino acid sequences 
to build the cluster tree. Only making use of these amino acid sequences, 
our supertypes are in exact agreement with WHO assigned serological types 
[H], as can be seen by checking the supertypes against the clusters in Table 
6. 
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1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
0.13 0.12 0.09 0.15 0.13 0.14 0.14 0.13 0.11 0.05 0.08 0.08 0.09 0.02 0.11 



ST52 ST3 



ST6 



ST8 ST4 ST2 ST5 ST53 ST9 ST7 ST51 ST10 ST1 



Figure 1: Cluster tree on 559 DRB alleles. The diameters of the leaf nodes 
are given at the bottom of the figure. The numbers given in the figure are 
the diameters of the corresponding unions of clusters. 



This second application is given in some detail in Section HI 



2 Kernel Method for Binding Affinity Pre- 
diction 

In this section we describe in detail the construction of our string kernel. 
The motivation is to relate the sequence information of strings (peptides 
or alleles) to their biological functions (binding affinities). A kernel works 
as a measure of similarity and supports the application of powerful machine 
learning algorithms such as RLS which we use in this paper. For a fixed allele, 
binding affinity is a function on peptides with values in [0,1]. The function 
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values on some peptides are available as the data, according to which RLS 
outputs a function that predicts for a new peptide the binding affinity to the 
allele. The method is generalized in the next section to the pan-allele kernel 
algorithm that takes also the allele structure into account. 



2.1 Kernels 

We suppose throughout the paper that X is a finite set. We now give the 
definition of a kernel, of which an important example is our string kernel. 

Definition 1. A symmetric junction K : X x X — > R is called a kernel on 

X if it is positive definite, in the sense that by choosing an order on X , K 
can be represented as a positive definite matrix (K(x,y)) X) yex- 

Kernels have the following properties [5J [33 EE|. 

Lemma 1. (i) If K is a kernel on X then it is also a kernel on any subset 

X I ofX. 

(ii) If K\ and K 2 are kernels on X, then K : X x X M, defined by 
K(x, x') = K±(x, x 1 ) + K 2 (x, x') 

is also a kernel. 

(Hi) If K\ is a kernel on X\ and K 2 is a kernel on X2, then K : [X\ x 
X 2 ) x (Xi x X 2 ) — > ffi defined by 

K((xi,x 2 ), (x[,x 2 )) = K^x^x'j) ■ K 2 (x 2 ,x 2 ) 

is a kernel on Xi x X 2 . 

(iv) If K is a kernel on X, and f is a real-valued function on X that 
maps no point to zero, then K' : X x X defined by 

K'(x,x') = f(x)K(x,x')f(x') 

is also a kernel. 

(v) If K(x,x) > for all x G X, then the correlation normalization K of 
K given by 

K( x , x >) = K{X ' X>) = (4) 
y /K(x,x)K(x , ,x') 

is also a kernel. 

Proof, (i), (ii) and (iv) follows the definition directly, (iii) follows the fact 
that the Kronecker product of two positive definite matrices is positive defi- 
nite; see [15] for details. The positive definiteness of a kernel K guarantees 
that K(x,x) > for any x in X, so (v) follows (iv). □ 

Remark 5. Notice that with correlation normalization we have K(x,x) = 1 
for all x G X . This is a desired property because the kernel function is usually 
used as a similarity measure, and with K we can say that each x G X is 
similar to itself. 

Define the real-valued function on X, K x , by K x (y) = K(x, y). The func- 
tion space J#k = sp&n{K x : x G X} is a Euclidean space with inner product 
(K x , K y ) = K(x,y), extended linearly to J#k- The norm of a function / in 
Jf K is denoted as 
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Remark 6. The kernel can be defined even without assuming X is finite; in 
this general case the kernel is referred to as a reproducing kernel If X is 
finite then a reproducing kernel is equivalent to our "kernel". The theory of 
reproducing kernel Hilbert spaces plays an important role in learning 

On a finite set X there are two notions of distance derived from a kernel 
K. The first one is the usual distance in Mk-, that is 

D K (x,x') = \\K X - K X ,\\ K , 

for two points x, x' G X. The second one is the L 2 distance defined by 

D L2 (x,x')= (^Y^(K(x,t)-K(x',t)) 2 

Important examples of the kernels discussed above are our kernel K 3 and 
its normalization K 3 , both defined on any finite X C Uk>i^ k 




2.2 Kernel on Strings 

We start with a finite set srf called the alphabet. In the work here srf is 
the set of 20 amino acids, but the theory in this section applies to any other 
finite set. For example, as the name suggests, it can work on text for semantic 
analysis with a similar setting. See also [5B] for the framework in vision. 

To measure a similarity among the 20 amino acids, Henikoff and Henikoff 
[T3] collect families of related proteins, align them and find conserved regions 
(i.e. regions that do not mutate frequently or greatly) as blocks in the fami- 
lies. The occurrence of each pair of amino acids in each column of every block 
is counted. A large number of occurrences indicate that in the conserved re- 
gions the corresponding pair of amino acids substitute each other frequently, 
or in another way of saying, that they are similar. A symmetric matrix Q 
indexed by srf x srf is eventually obtained by normalizing the occurrences, so 
that y&a ? Q(x,y) = 1 and Q(x,y) indicates the frequency of occurrences. 
See [13] for details. The BLOSUM62 matrix is constructed accordingly. 

Define K 1 : srf x srf R as 

^(x^y) = ( ® 'f . ) , for some /3 > 0, 
\p{x)p{y)J 

where p : — > [0, 1] given by 

P( x ) = ^2Q(x,y), 

is the marginal probability distribution on stf ' . When f3 = 1, we name the 
matrix y)) x ,yesf as BLOSUM62-2 (one takes logarithm with base 2, 

scales it with factor 2, and rounds the obtained matrix to integers to obtain 
the BLOSUM62 matrix). Notice that if one chooses simply Q = ^I m xm, 
then one obtains the matrix I mxm as the analogue of the BLOSUM62-2, and 
the corresponding K 3 of the introduction is called the spectrum kernel |17j . 

In matrix language K 1 is the Hadamard power of the BLOSUM62-2 ma- 
trix, where for a matrix M = (Mi j) with positive entries and a number 
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[3 > 0, we denote M 0/3 as the /3'th Hadamard power of M and log° M as the 
Hadamard logarithm of M, and their entries are respectively, 

(M°P) it j := (M^, (log M)jj := log ( Mj j ). 

Theorem 1 (Horn and Johnson [15]). Let A be an m x m positive-valued 
symmetric matrix. The Hadamard power is positive definite for any 
f3 > if and only if the Hadamard logarithm log° A is conditionally positive 
definite (i.e. positive definite on the space V = {v = {v\, • • • ,v m ) G R m : 

Proposition 1. Every positive Hadamard power of BLOSUM62-2 is positive 
definite. Thus the above defined K l is a kernel for every > 0. 

Proof. One just shows the eigenvalues of the Hadamard logarithm on V are 
all positive. One checks this by computer. 

Theorem 2. Based on any kernel K 1 , the functions K\, K 3 , and K 3 defined 
as in the introduction are all kernels. 

Proof. The fact that K\ is a kernel for k > 1 follows from Lemma [1] (iii) . We 
now prove that K 3 is positive definite on any finite set X of strings, which 
then implies the same for K 3 by Lemma [T] (v). From Lemma [T] (i) it suffices 
to verify the cases that X = Xk = U^ =1 £^ 1 for k > 1. When k = 1, K 3 is just 
K 1 and hence positive definite. We assume now that K 3 is positive definite 
on Xk with k = n. 

We claim that the matrices indexed by X n+ i, 

Klx n+1 (f,9)=\ \u\=\v\=i 

{ if |/| < % or \g\ < i, 

are all positive semi-definite. In fact, for any 1 < i < n, 

K 3 Xn+i = PiKfPf, (5) 

where Kf is the matrix (Kf(u, v)) u ^ v£si /i, and Pi is a matrix with X n+1 as the 
row index set and stf 1 as the column index set, and for any / G X n+ i and 
u G Pi(f,u) counts the number of times u occurs in /. Let us explain 
equation ([5]) a little more. For / and g in X n+ i, from the definition of Pi we 
have 

(P l K?Pn(Lg)= Yl Pi(f,u)Pl(g,v)K?(u,v)= K ?M> V M6) 

u,vG^ i uCf,vCg 

\u\ = \v\=i 

Summing the equation (jBj) above over i G N gives the definition of K 3 (f,g). 
For i = n + 1 , we have 

3 (fn .j ft^ n+1 orgt*f n+1 , 

ViA+iU-W \ Kl +1 (f,g) otherwise. 

Therefore x n+1 is positive definite on £/ n+1 , and is zero elsewhere. Since 



K 3 (f, g) = Y< K tx n+1 (/, V/, 5 g X„ 



i=l 
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we know that the sum of Kf Xni with % = 1, • • • , n are positive definite on 
X n , and positive semi-definite on X n+ i. Because 

n+l 

K%f,g) = Y,Kl Xn+1 (f,g), V/ )5 eI B+ll 

8=1 

we see that K 3 is positive definite on X n+ \. □ 

Corollary 1. Our kernels K\, K 3 and K 3 are discriminative. That is, 
given any two strings f, g in the domain of K , as long as f ^ g, we have 
Dx(f,g) > 0. Here K stands for any of the three kernels. 



2.3 First Application: Peptide Affinities Prediction 

We first briefly review the RLS algorithm inspired by learning theory. Let K 
be a kernel on a finite set X. Write M'k to denote the inner product space 
of functions on X defined by K. Suppose z = {{x i ,y i )} r lL l is a sample set 
(called the training set) with Xi G X and e K for each i. The RLS uses a 
positive parameter A > and z to generate the output function f^x : X — > R, 
defined as 

f MtX = arg mm J ±= £ (/(*,) - Vl f + \\\ff K 1 . (7) 

Since J#k is of finite dimension, one solves (J7]) by representing / linearly by 
functions K x with x G X and finding the coefficients. See [5j [32] for details. 

Remark 7. The RLS algorithm |?P is independent of the choice of the un- 
derlying space X where the function space J#k is defined, in the sense that 
the predicted values fz,\{x) at x E X will not be changed if we extend K onto 
a large set I'dI and re-run (0) with the same z and A. This is guaranteed 
by the construction of the solution. See, e.g. /2l \3E/- 

Five-fold cross validation is employed to evaluate the performance of the 
algorithms. Suppose z is partitioned into five divisions (we assume m > 5, 
which is always the case in this paper). Five- fold cross validation is the 
procedure that validates an algorithm (with fixed parameters) as follows. We 
choose one of the five divisions of the data for testing, train the algorithm on 
the remaining four divisions, and predict the output function on the testing 
division. We do this test for five times so that each division is used in one 
time as the testing data and thus every sample Xi is labeled with both the 
observed value yi and the predicted value y^. The algorithm performance is 
obtained by comparing the two values over all the sample set. Similarly one 
defines the n-fold cross validation for any n < m. As an important special 
instance, the m-fold case is also referred to as leave-one-out cross validation. 
Cross validations are also used to tune parameters. 

Binding affinity measures the strength that a peptide binds to an allele, 
and is represented by the IC50 score. Usually an IC50 score lies between 
and 50,000 (nano molar). A widely used IC50 threshold determining binding 
and non-binding is 500 ("binding" if the IC50 value is less than 500). The 
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bioinformatics community usually normalizes the scores by the function ip^ : 
(0, +00) — > [0, 1] with a base b > 1, 

[ x > b, 

ipb(x) '■= { 1 — logfe^ 1 < x < (8) 
[l x < 1. 

Without introducing any ambiguity we will in the sequel refer to the normal- 
ized IC50 value as the binding affinity using an appropriate value of b. 

We test the kernel with RLS on the IEDB benchmark data set published 
on [25]. The data set covers 14 DRB alleles, each allele a with a set & a 
of peptides. For any p G its sequence representation and the [0, 1]- 
valued binding affinity y ap to the allele a are both given. On this data 
set we compare our algorithm with the state-of-the-art NN-align algorithm 
proposed in [21] . In [21] for each allele a, the peptide set 2? a was divided 
into 5 parts for validating the performance!. 

Now fix an allele a. Set X = D & a (Remark [7J shows that one 
may select any finite & that contains here). Define the kernel K 3 on 
X through the steps in the Introduction (leaving the power index j3 to be 
fixed). We use the same 5- fold partition £P a = Uf =1 ^ 0; t as in [25], and 
use five- fold cross validation to test our algorithm ([7]) with K = K 3 . In 
the f'th test (t = 1, • • • ,5) four parts of 3P a are merged to be the training 
data, denoted as g?]^ = ^aX^aji an d £? a ,t is left as the testing data. For 
fixed t and a, we further tune the parameter f3 in K 3 and the regularization 
parameter A in ([7]) by leave-one-out cross validation with z = g?a '■ Every 
pair of f3 in the geometric sequence {0.001, • ■ ■ , 10} of length 30 and A in the 
geometric sequence {e~ 17 , ■ • • ,e~ 3 } of length 15 is tested. With the optimal 
pair A^), we train the RLS © once more on to give the predicted 
binding function /„(*) ( t ) ( t ) on & . After the five times of testing on allele 
a, we denote y ap = f w (t) (t) (p) for each p G & a t and t = 1, ■ • • ,5. 

The RMSE score is therefore evaluated as 



RMSE a = ^-Y,^-VaJ- 

V ^ a PC** 

A smaller RMSE score indicates a better algorithm performance. Since the 
affinity labels in this data set are transformed with "06=50,000; there is a thresh- 
old 9 = ^50,000 (500) « 0.4256 in [21] dividing the peptides p G & a into 
"binding" if y a<p > 9 and "non-binding" otherwise, to the allele a. Denote 
&a,B = {pe^ a : y a>p > 9} and 0» a>N = ^ a \^ a , B . Then the AUC index is 
defined to be 

ATT^ _ : P e &a,B, P' g ^a,N, Va,p > jja,p'} r 1 ] , Q , 

The sequence of ideas for each allele a leads to Table [TJ The computation 
also suggests a weighted optimal values of (5 

^ : =g^rE{(#^«) = - 11387 - ( 10 ) 

We will use this value in the next section. 



3 Both the data set and the 5-fold partition are available at http://www.cbs.dtu.dk/ 
suppl/immunology/NetMHCII-2 . .php. 
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Remark 8. We take the point of view that peptide binding is a matter of 
degree and hence is better measured by a real number, rather than the binding- 
non-binding dichotomy. Thus RMSE is a better measure than AUC. The 
results in Table U\ also demonstrate that the regression-based learning model 
works well. 

Remark 9. Our philosophy is that there is a kernel structure on the set of 
amino acid sequences related to their biological functions (e.g. the correspon- 
dent distances on peptides relates to their affinities to each allele). The kernel 
should not depend on the alignment information, which is a source of noise. 
The performance of our kernel K 3 is reflected in the modulus of continuity 
of the predicted values, namely, 



\Ua,p Ua,p' 

max — f — 



where 



\Kl-Kl 



\k 3 



2-2K 3 (p,p>) 



is the distance in the space ^^-3 on peptides, and the kernel K 3 is defined 
with j3 = /3* eptide . We list the values of Q a for the 14 alleles in Tabled 



Allele a 




Allele a 




Allele a 




DRB1*0101 


1.2222 


DRB1*0301 


1.0307 


DRB1*0401 


0.9249 


DRB1*0404 


0.9726 


DRB1*0405 


0.8394 


DRB1*0701 


1.1317 


DRB1*0802 


0.9368 


DRB1*0901 


0.8004 


DRB1*1101 


0.9795 


DRB1*1302 


0.7745 


DRB1*1501 


0.9843 


DRB3*0101 


0.7395 


DRB4*0101 


0.8587 


DRB5*0101 


1.0011 







Table 2: The module of continuity of the predicted values. 



The modulus of continuity can be extended to a bigger peptide set 
which contains the neighbourhood of each peptide p G with respect to the 
metric d. 



3 Kernel Algorithm for pan-Allele Binding 
Prediction 

We now define a pan-allele kernel on the product space of alleles and peptides. 
The binding affinity data is thus a subset of this product space. The main 
motivation is that by the pan-allele kernel we predict affinities to those alleles 
with few or no binding data available: this is often the case because the MHC 
II alleles form a huge set (the phenomenon is often referred to as MHC II 
polymorphism), and the job of determining experimentally peptide affinities 
to all the alleles is immense. Also, in the pan-allele setting, one puts the 
binding data to different alleles together to train the RLS. This makes the 
training data set larger than that was available in the fixed allele setting, and 
thus helps to improve the algorithm performance. This is verified in Table HI 
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Let if be a finite set of amino acid sequences representing the MHC 
II alleles. Using a positive parameter allele we define a kernel K\ on if 
following the steps in the Introduction. Let & be a set of peptides. In 
the sequel we denote by ^ pep tide specifically the parameter used to define the 
kernel on & . We define the pan- allele kernel on if x 3? as 

^ n ((a )P ),( fl ',p'))=4K«')4(M'). (11) 

Let be given a set of data {(pi, Oj, Then for each i, a* G if, Pi G 

and rj G [0, 1] is the binding affinity of Pi to a^. The RLS is applied as in 
Section [2J The output function F : if x & — >■ R is the predicted binding 
affinity 

Remark 10. When we choose if = {a} for a certain allele a, the setting 
and the algorithm reduce to the fixed-allele version studied in Section [H 

We test the pan-allele kernel with RLS (we call the algorithm "Kernel- 
RLSPan") on Nielsen's NetMHCIIpan-2.0 data set (we also denote by this 
name the algorithm published on [23] with the data set), which contains 
33,931 peptide-allele pairs. For peptides, amino acid sequences are given, 
and for alleles, DRB names are given so that we can find out the sequence 
representations in jY as defined in Section ll.2[ Each pair is labeled with a 
[0, 1]- valued binding affinity. There are 8083 peptides and 24 alleles in JV in 
total that appear in these peptide-allele pairs. The whole data set is divided 
into 5 parts in [23^ . 

We choose the following setting. Let if = JV and 8? be a peptide set 
large enough to contain all the peptides in the data set. We use (3* eptide = 
0.11387 as suggested in ffTUj) to construct Kg, and leave the power index f3 a u e i e 
for Kj, to be fixed later. This defines Kp an . We test the RLS algorithm by 
five-fold cross validation according to the 5-part division in [23] . In each test 
we merge 4 parts of the samples as the training data and leave the other 
part as the testing data. Leave-one-out cross validation is further employed 
in each test to tune the parameters. We select a pair (j3 a u e i e ,\) from the 
product of {0.02 x n : n = 1, 2, • ■ ■ , 8} and {e n : n = -17, -16, • ■ ■ , -9}. 
The procedures are the same as used in Section 12.31 except we now do cross 
validation for the peptide-allele pairs. In all the five tests, the pair f5 a iieie — 
0.06 and A = e -13 achieves the best performance in the training data. We 
now use the threshold 6 = ^15,000 (500) ~ 0.3537 to evaluate the AUC score, 
because the affinity values in the data set are obtained by the transform 
■015,000- The results of these computations are shown in Table [31 

4 Both the data set and the 5-part partition are available at http : //www . cbs . dtu . dk/ 
suppl/immunology/NetMHCIIpan-2 . 0. 



14 



11 1 

allele, a 




KernelRLS 
RMSE AUC 


NetMHCIIpan-2.0 
AUC 


DRB1*0101 


7685 


0.20575 


0.84308 


0.846 


DRB1*0301 


2505 


0.18154 


0.85095 


0.864 


DRB1*0302 


148 


0.21957 


0.71176 


0.757 


DRB1*0401 


3116 


0.19860 


0.84294 


0.848 


DRB1*0404 


577 


0.21887 


0.80931 


0.818 


DRB1*0405 


1582 


0.17459 


0.86862 


0.858 


DRB1*0701 


1745 


0.17769 


0.87664 


0.864 


DRB1*0802 


1520 


0.18732 


0.78937 


0.780 


DRB1*0806 


118 


0.23091 


0.89214 


0.924 


DRB1*0813 


1370 


0.18132 


0.88803 


0.885 


DRB1*0819 


116 


0.18823 


0.82706 


0.808 


DRB1*0901 


1520 


0.19741 


0.82220 


0.818 


DRB1*1101 


1794 


0.16022 


0.88610 


0.883 


DRB1*1201 


117 


0.22740 


0.87380 


0.892 


DRB1*1202 


117 


0.23322 


0.89440 


0.900 


DRB1*1302 


1580 


0.19953 


0.82298 


0.825 


DRB1*1402 


118 


0.20715 


0.86474 


0.860 


DRB1*1404 


30 


0.18705 


0.64732 


0.737 


DRB1*1412 


116 


0.26671 


0.89967 


0.894 


DRB1*1501 


1769 


0.19609 


0.82858 


0.819 


DRB3*0101 


1501 


0.15271 


0.82921 


0.85 


DRB3*0301 


160 


0.26467 


0.86857 


0.853 


DRB4*0101 


1521 


0.16355 


0.87138 


0.837 


DRB5*0101 


3106 


0.18833 


0.87720 


0.882 


Average 




0.20035 


0.84109 


0.846 


Weighted Average 




0.19015 


0.84887 


0.849 



Table 3: The performance of KernelRLSPan. For comparison we list the 
AUC scores of NetMHCIIpan-2.0 [23]. The best AUC in each row is marked 
in bold. 

We implement KernelRLSPan on the fixed allele data set used in Table 
[TJ Recall that the data set is normalized with ^50,000 and has the five-fold 
division defined by [25J . The performance is listed in Table IH which is better 
than that of KernelRLS as listed in Table [H 



allele, a 


RMSE 


AUC 


allele, a 


RMSE 


AUC 


DRB1*0101 


0.17650 


0.86961 


DRB1*0301 


0.16984 


0.85601 


DRB1*0401 


0.20970 


0.82359 


DRB 1*0404 


0.17240 


0.88193 


DRB1*0405 


0.18425 


0.84078 


DRB 1*0701 


0.17998 


0.90231 


DRB1*0802 


0.16734 


0.88496 


DRB1*0901 


0.23562 


0.71057 


DRB1*1101 


0.17073 


0.91022 


DRB1*1302 


0.23261 


0.75960 


DRB1*1501 


0.21266 


0.80724 


DRB3*0101 


0.16011 


0.79778 


DRB4*0101 


0.18751 


0.84754 


DRB5*0101 


0.18904 


0.89585 



Average: RMSE 0.18916, AUC 0.84200 



Weighted Average: RMSE 0.18496, AUC 0.85452 



Table 4: The performance of KernelRLSPan on the fixed allele data. For 
defining AUC, the transform ^50,000 is used as in Table [TJ 

Next, we use the whole NetMHCIIpan-2.0 data set for training, and test 
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the algorithm performance on a new data set. A set of 64798 triples of MHC 
II-peptide binding data is downloaded from IEDeH. We pick from the set the 
DRB alleles, having IC50 scores, and having explicit allele names and peptide 
sequences. Those items that also appear in the NetMHCIIpan-2.0 data set 
are deleted. For the duplicated items (same peptide-allele pair and same 
affinity) only one of them are kept. All the pieces with the same peptide- 
allele pair yet different affinities are deleted. We deleted those with peptide 
length less than 9. (The KernelRLSPan can handle these peptides, while 
the NetMHCIIpan-2.0 cannot. The short peptides therefore are deleted to 
make the two algorithms comparable.) For some alleles the data in the set 
is insufficient to define the AUC score (i.e. the denominator in (Q becomes 
zero), so we delete tuples containing them. Eventually we obtained 11334 
peptide-allele pairs labelled with IC50 binding affinities, which are further 
normalized by ^15,000 as i n the NetMHCIIpan-2.0 data set. 

Now define K^ an on jV x & as in f TTTj) with /3 a a e i e = 0.06 as suggested by 
the above computation and f3 pep tide = 0.11387 as suggested in ffTDl . We train 
on the NetMHCIIpan-2.0 data set both KernelRLSPan and NetMHCIIpan- 
2.C0. In the KernelRLSPan, leave-one-out cross validation is used to select 
A from {e -18 , • • • ,e~ 8 } (the result shows that A = e -13 performs the best). 
The algorithm performance of the two algorithms are compared on Table 

5 The data set was downloaded from http : //www . immuneepitope . org/list_page . php? 
list jtype=mhc&measuredjresponse=&totaljrows=64797&queryType=true, on May 23, 
2012. 

6 The code is published on http : //www. cbs . dtu.dk/ cgi-bin/nph-swjrequest?netMHCIIpan. 
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41 02 
W^ 7 a 


kernelRLSpan 


NetMHCIIpan-2.0 


RMSE 


AUC 


RMSE 


AUC 


DRB1*0101 


1024 


0.25519 


0.79717 


0.24726 


0.82988 


DRB1*0102 


7 


0.39748 


0.58333 


0.62935 


0.58333 


DRB1*0103 


41 


0.33159 


0.83333 


0.32204 


0.83333 


DRB1*0301 


883 


0.21760 


0.80276 


0.23975 


0.82384 


DRB1*0401 


1122 


0.19610 


0.79930 


0.19363 


0.82456 


DRB1*0402 


48 


0.23912 


0.67321 


0.27352 


0.65714 


DRB1*0403 


43 


0.16381 


0.70443 


0.15868 


0.66995 


DRB 1*0404 


494 


0.21689 


0.79344 


0.20219 


0.82517 


DRB1*0405 


462 


0.19617 


0.78941 


0.19387 


0.80611 


DRB1*0406 


14 


0.19516 


0.53846 


0.19497 


0.61538 


DRB1*0701 


724 


0.20853 


0.80876 


0.20039 


0.84786 


DRB1*0801 


24 


0.37281 


0.72500 


0.34767 


0.71250 


DRB1*0802 


404 


0.17403 


0.80407 


0.17181 


0.81085 


DRB1*0901 


335 


0.21204 


0.79524 


0.21029 


0.80489 


DRB1*1001 


20 


0.28082 


0.74000 


0.24335 


0.92000 


DRB1*1101 


811 


0.24195 


0.83219 


0.23838 


0.85071 


DRB1*1104 


10 


0.43717 


0.76190 


0.57082 


0.57143 


DRB1*1201 


795 


0.25786 


0.83178 


0.24984 


0.82685 


DRB1*1301 


147 


0.27014 


0.65077 


0.30202 


0.70722 


DRB1*1302 


499 


0.22194 


0.82118 


0.21284 


0.84258 


DRB1*1501 


856 


0.21580 


0.83563 


0.20869 


0.84902 


DRB1*1502 


3 


0.13186 


1.00000 


0.20061 


1.00000 


DRB1*1601 


16 


0.19556 


0.84615 


0.18740 


0.76923 


DRB1*1602 


12 


0.32238 


0.68571 


0.30431 


0.60000 


DRB3*0101 


437 


0.16568 


0.74058 


0.17860 


0.77182 


DRB3*0202 


750 


0.16021 


0.82543 


0.16453 


0.84191 


DRB4*0101 


563 


0.20594 


0.80575 


0.21383 


0.78734 


DRB5*0101 


774 


0.25934 


0.78701 


0.25849 


0.81950 


DRB5*0202 


16 


0.23013 


0.71429 


0.40554 


0.57143 


Average 




0.24046 


0.76987 


0.25947 


0.77151 


Weighted Average 




0.21853 


0.80309 


0.21816 


0.82216 



Table 5: The performance of KernelRLSPan and NetMHCIIpan-2.0 trained 
on the NetMHCIIpan-2.0 benchmark data set, tested on a new dataset down- 
loaded from the IEDB. The best performance of both AUC and RMSE scores 
of each row is marked in bold. 



In this section KernelRLSPan is tested. Tables EJ @] and suggest that 
compared with KernelRLS, KernelRLSPan performs much better. Also, the 
kernel method uses only the substitution matrix and the sequence represen- 
tations without direct alignment information but yields comparable perfor- 
mance with the state-of-the-art NetMHCIIpan-2.0 algorithm. 

4 Clustering and Supertypes 

In this section, we describe in detail the construction of our cluster tree and 
our classification of DRB alleles into supertypes. We compare the supertypes 
identified by our model with the serotypes designated by WHO and analyze 
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the comparison results in detail. 

4.1 Identification of DRB Supertypes 

We classify DRB alleles into disjoint subsets by using DRB amino acid se- 
quences and the BLOSUM62 substitution matrix. No peptide binding data 
or X-ray 3D structure data are used in our clustering. We obtain a classifi- 
cation in this way into subsets (a partition) which we call supertypes. 

In Section [21 we have defined the allele kernel on jY as K\\ the L 2 
distance derived from K 3 V is defined as 



D L 2(x,y) = 




Vx,y e jV . 



The OWA-based linkage, defined as follows is used to measure the proxim- 
ity between clusters X and Y0. Let U = (d xy ) xe x,yeY, where d xy = D L 2(x, y). 
After ordering (with repetitions) the elements of U in descending order, we 
obtain an ordered vector V = (d[, . . . , d' n ), n = \U\. A weighting vector 
W = (wi, - • • , w n ) is associated with V, and the proximity between clusters 
X and Y is defined as 

n 

DowA{X,Y)=^T l w i dt i . 

i=l 

Here the weights W are defined as follows [28] : 

e i/n 

w t = , i = 1,2, • • • ,n, 

fi 

w\ ■ _ 1 o 

where \x = 7(1 + n) and 7 is chosen appropriately as 0.1. This weighting 
gives more importance to pairs (x, y) which have smaller distance. 

Hierarchical clustering [B] is applied to build a cluster tree. A cluster tree 
is a tree on which every node represents the cluster of the set of all leaves 
descending from that node. The L 2 distance D L 2 is used to measure the 
distance between alleles x and y, x,y G M and OWA-based linkage is used 
to measure the proximity between clusters X and Y, X,Y C ^ instead of 
"single" linkage. This algorithm is a bottom-up approach. At the beginning, 
each allele is treated as a singleton cluster, and then successively one merges 
two nearest clusters X and Y into a union cluster, the process stopping when 
all unions of clusters have been merged into a single cluster. 

This cluster tree, associated to has thus 559 leaves. We cut the 
cluster tree at 16 clusters, an appropriate level to separate different families 
of alleles. The upper part of this tree is shown in Figure 1. The contents of 
the clusters are given in Table 6. We assign supertypes to certain clusters in 
the cluster tree based on the contents of the clusters described in Table 6. A 
supertype is based on one or two clusters in Table 6. If two clusters in Table 
6 are closest in the tree, and the alleles in which are in the same family, they 

7 Anothcr way of measuring distance between clusters is the Hausdorff distance. 
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are assigned an identical supertype. Thirteen supertypes are defined in this 
way, which we name ST1, ST2, ST3, ST4, ST5, ST6, ST7, ST8, ST9, ST10, 
ST51, ST52 and ST53. The corresponding cluster diameters are 0.11, 0.13, 
0.15, 0.14, 0.11, 0.18, 0.08, 0.14, 0.08, 0.02, 0.09, 0.13 and 0.05, respectively. 
The diameter of a cluster Z is defined as 

diameter(Z) = max D L 2(x, y). (12) 

x,y£Z 

The DRB alleles in the first ten supertypes are gathered from the DRB1 
locus. The DRB alleles in the ST51, ST52 and ST53 supertypes are gathered 
from the DRB5 , DRB3 and DRB4 loci, respectively. 

4.2 Serotype designation of HLA-DRB alleles 

There is a historically developed classification, based on extensive works 
of medical labs and organizations, that groups alleles into what are called 
serotypes. This classification is oriented to immunology and diseases associ- 
ated to gene variation in humans. It uses peptide binding data, 3D structure, 
X-ray diffraction and other tools. When the confidence level is sufficiently 
high, WHO assigns a serotype to an allele as in Table 6 where a number 
prefixed by DR follows the name of that allele. 

There are four DRB genes (DRB1/DRB3/DRB4/DRB5) in the HLA- 
DRB region [IB]. The DRB1 gene/locus is much more polymorphic than 
the DRB3/DRB4/DRB5 genes/loci [3]. More than 800 allelic variants are 
derived from the exon 2 of the DRB genes in humans [9] . The WHO Nomen- 
clature Committee for Factors of the HLA System assigns an official name for 
each identified allele sequence, e.g. DRB1*01:01. The characters before the 
separator "*" describe the name of the gene, the first two digits correspond 
to the allele family and the third and fourth digits correspond to a specific 
HLA protein. See Table 6 for examples of how the alleles are named. If two 
HLA alleles belong to the same family, they often correspond to the same 
serological antigen, and thus the first two digits are meant to suggest sero- 
logical types. So for those alleles which are not assigned serotypes by WHO, 
WHO has suggested serotypes for them according to their official names or 
allele families. 

4.3 Comparison of identified supertypes to designated 
serotypes 

In Section 4.1, we have identified thirteen supertypes and in Section 4.2 we 
have introduced the WHO assigned serotypes. In the following, we compare 
these two classifications. 

By using the cluster tree given in Figure 1 and the contents of the clusters 
described in Table 6, we have named our supertypes with prefix "ST" par- 
alleled to the serotype names. The detailed information of DRB alleles and 
serological types for these 13 supertypes is given in Table 6. Our supertype 
clustering recovers the WHO serotype classification and provides further in- 
sight into the classification of DRB alleles which are not assigned serotypes. 
There are 559 DRB alleles in Table 6, and only 138 DRB alleles have WHO 
assigned serotypes. Table 7 gives the relationship between the broad sero- 
logical types and the split serological types. As shown in Tables 6 and 7, 
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our supertypes assigned to these 138 DRB alleles are in exact agreement 
with the WHO assigned broad serological types (see Table 7). Extensive 
medical/biological information was used by WHO to assign serological type 
whereas solely DRB amino acid sequences were used in our supertype clus- 
tering. All alleles with WHO assigned DR52, DR3, DR6, DR8, DR4, DR2, 
DR5, DR53, DR9, DR7, DR51, DR10 and DRl-serotype are classified, re- 
spectively, into the ST52, ST3, ST6, ST8, ST4, ST2, ST5, ST53, ST9, ST7, 
ST51, ST10 and STl-supertype. The other 461 alleles in Table 6 are not 
assigned serotypes by WHO in [T4]. However, WHO has suggested serotypes 
for them according to their official names or allele families; that is, if two 
DRB alleles are in the same family, they belong to the same serotype. Our 
clustering confirms that this suggestion is reasonable, as can be checked from 
the clusters in Table 6. 

We make some remarks on Figure 1 and Table 6 as follows. 

ST52: This supertype consists of exactly the DRB3 alleles with the ex- 
ception of DRB1*0338 (a new allele and unassigned by WHO [13]). 

ST3: This supertype consists of cluster 2 and cluster 3 in the cluster 
tree and contains 63 DRB1*03 alleles with two exceptions: DRB3*0115 and 
DRB1*1525. The DRB3*0115 is grouped with the DRB1*03 alleles in a 
number of different experiments done by us, and the DRB 1*1525 is a new 
allele and unassigned by WHO. Here, the DR3-serotype is a broad serotype 
which consists of three split serotypes, DR3, DR17 and DR18 (see Table 7). 

ST6: This supertype consists of cluster 4 and cluster 5 and consists of ex- 
actly 102 DRB 1*14 alleles. Here, the DR6-serotype is a broad serotype which 
consists of five split serotypes, DR6, DR13, DR14, DR1403 and DR1404. 

ST8: This supertype consists of cluster 6 and cluster 7 and mainly con- 
tains 46 DRB1*08 alleles (The serological designation of DRB1*1415 is DR8 
by WHO.). The unassigned alleles DRB1*1425, DRB1*1440, DRB1*1442, 
DRB1*1469, DRB1*1477 and DRB1*1484 are DRB1*14 alleles, but they are 
classified into the ST8 supertype. Both DRB1*14116 and DRB1*14102 are 
new allele sequences that do not exist in the tables of [T4| |2"2"] and they are 
classified into the ST8 supertype too. 

Supertypes 52, 4, 2, 5, 53, 9, 7, 51, 10 and 1 correspond, respectively, to 
clusters 1, 8, 9, 10, 11, 12, 13, 14, 15 and 16 in the cluster tree. 

ST4: This supertype consists of exactly 99 DRB1*04 alleles. 

ST2: This supertype consists of 53 DRB1*15 alleles and 16 DRB1*16 
alleles. Here, the DR2-serotype is a broad serotype which consists of three 
split serotypes, DR2, DR15 and DR16. 

ST5: This supertype contains exactly 29 DRB1*12 alleles. TheDRBl*0832 
is undefined by experts in |14] . but its serological designation by the neural 
network algorithm [2T] is DR8 or DR12. We classify it into the ST5 su- 
pertype. The DR5-serotype is a broad serotype which consists of two split 
serotypes, DR11 and DR12. 

ST53: This supertype consists of exactly the DRB4 alleles. 

ST9: This supertype contains exactly the DRB1*09 alleles with the ex- 
ception of DRB5*0112. The DRB5*0112 is undefined by experts in [HJ. And 
from a number of different experiments done by us, DRB5*0112 is clustered 
with the DRB1*09 family of alleles. 

ST7: This supertype consists of exactly 19 DRB1*07 alleles. 

ST51: This supertype consists of exactly 15 DRB5 alleles. 



20 



ST10: This supertype is the smallest supertype and consists of exactly 3 
DRB1*10 alleles. 

ST1: This supertype consists of exactly 36 DRB1*01 alleles. Here, the 
DRl-serotype is a broad serotype which consists of two split serotypes, DR1 
and DR103. 

For the DRB alleles, there are thirteen broad serotypes given by WHO, 
and our clustering classifies all alleles which are assigned the same broad 
serotype to the same supertype. And for the alleles which are not assigned 
serotypes, our supertypes confirm the nomenclature of WHO. 

As can be seen from Figure 1, the ST52 supertype is closest to the ST3 
supertype. The ST53 supertype is closest to the ST9 and ST7 supertypes. 
The ST51 supertype is closest to the ST10 and ST1 supertypes. 

4.4 Previous work in perspective 

In 1999, Sette and Sidney asserted that all HLA I alleles can be classified 
into nine supertypes [3U |37]. This classification is defined based on the 
structural motifs derived from experimentally determined binding data. The 
alleles in the same supertype comprise the same peptide binding motifs and 
bind to largely overlapping sets of peptides. Essentially, the supertype clas- 
sification problem is to identify peptides that can bind to a group of HLA 
molecules. Besides many works on HLA class I supertype classification, some 
works have been proposed to identify supertypes for HLA class II. In 1998, 
through analyzing a large set of biochemical synthetic peptides and a panel of 
HLA-DR binding assays, Southwood et al. [3H] asserted that seven common 
HLA-DR alleles, e.g. DRB1*0101, DRB1*0401, DRB1*0701, DRB1*0901, 
DRB1*1302, DRB1*1501 and DRB5*0101 had similar peptide binding speci- 
ficity and should be grouped into one supertype. By the use of HLA ligands, 
Lund et al. [19] clustered 50 DRB alleles into nine supertypes by a Gibbs 
Sampling algorithm. Both of these studies used peptide binding data and 
this resulted in the limited number of DRB alleles available for classification. 
The work of Doytchinova and Flower [7], classified 347 DRB alleles into 5 
supertypes by the use of both protein sequences and 3D structural data. Ou 
et al. [26]. defined seven supertypes based on similarity of function rather 
than on sequence or structure. To our knowledge, our study is the first to 
identify HLA-DR supertypes solely based on DRB amino acid sequence data. 
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DRB1*0105 




DRB1*0114 




DRB1*0134 




DRB1*0125 




DRB1*0116 





Table 6: Overview of clusters of HLA-DR alleles with split sero- 
logical types assigned by WHO. 



The split serological types are obtained from [13]. The left column indicates 
the supertypes defined by the cluster tree. Remark on the labels for the 
alleles: "(U.)" stands for "undefined" marked by the experts in [14] : "(s.s.)" 
indicates that the normal forms of the allele is shorter than 81 amino acids; 
"(n)" with n — 2, 3, ■ • • indicates that the normal form is shared by n alleles. 
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HLA-DRB1 serological families 



Broad Serotype 


Split serotype 


Alleles 


DR1 


DR1 


DRB1*01 


DR103 


DRB1*0103 


DR2 


DR2 


DRB1*1508, *1603 


DR15 


DRB1*15 


DR16 


DRB1*16 


DR3 


DR3 


DRB1*0305, *0306, *0307, 
*0312, *0314, *0315, *0323 


DR17 


DRB1*0301, *0304, *0310, *0311 


DR18 


DRB1*0302, *0303 


DR4 


DR4 


DRB1*04 


DR5 


DR11 


DRB1*11 


DR12 


DRB1*12 


DR6 


DR6 


DRB1*1416, *1417, *1418 


DR13 


DRB1*13, *1453 


DR14 


DRB1*14, *1354 


DR1403 


DRB 1*1403 


DR1404 


DRB 1*1404 


DR7 


DR7 


DRB1*07 


DR8 


DR8 


DRB1*08, *1415 


DR9 


DR9 


DRB1*09 


DR10 


DR10 


DRB1*10 


DRB3/4/5 serological families 


Serotype 


Alleles 


DR51 


DRB5*01,02 


DR52 


DRB3*01,02,03 


DR53 


DRB4*01 



Table 7: Overview of the broad serological types in connection with the split 
serological types assigned by WHO. The serological type information listed 
in this table was extracted from the Tables 4 and 5 given in [14]. This table 
summarizes the allele and serotype information given in the first and third 
columns of Tables 4 and 5. 



We are far from claiming to have any definitive answers or final state- 
ments on these questions of peptide binding and serotype clustering. Many 
problems here are left unresolved. For example, the serotype clustering re- 
sult is more provocative than otherwise and further studies are needed. One 
could look at more automatic choice of the supertypes, or develop compara- 
tive schemes. One could also study problems of phylogenetic trees from this 
point of view as those of H5N1. Extending the framework to 3D structures of 
proteins, instead of just amino acid chains is suggested. We intend to study 
these questions ourselves and hope that our study will persuade others to 
think about these kernels on amino acid chains. 
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Appendix 

A The BLOSUM62-2 Matrix 

We list the whole BLOSUM62-2 matrix in Table [HJ Table M explains the 
amino acids denoted by the capital letters. 
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A 




R 




N 




D 




c 




Q 




E 




G 




H 




I 


A 

A 


3 


9029 





6127 





5883 





5446 





8680 





7568 





7413 


1 


0569 





5694 





6325 


R 





6127 


G 


6656 





8586 





5732 





3089 


1 


4058 





9608 





4500 





9170 





3548 


N 





5883 





8586 


7 


0941 


1 


5539 





3978 


1 


0006 





9113 





8637 


1 


2220 





3279 


D 





5446 





5732 


1 


5539 


7 


3979 





3015 





8971 


1 


6878 





6343 





6786 





3390 


C 





8680 





3089 





3978 





3015 


19 


5766 





3658 





2859 





4204 





3550 





6535 


Q 





7568 


1 


4058 


1 


0006 





8971 





3658 


6 


2444 


1 


9017 





5386 


1 


1680 





3829 


E 





7413 





9608 





9113 


1 


6878 





2859 


1 


9017 


5 


4695 





4813 





9600 





3305 


G 


1 


0569 





4500 





8637 





6343 





4204 





5386 





4813 


6 


8763 





4930 





2750 


H 





5694 





9170 


1 


2220 





6786 





3550 


1 


1680 





9600 





4930 


13 


5060 





3263 


I 





6325 





3548 





3279 





3390 





6535 





3829 





3305 





2750 





3263 


3 


9979 


L 





6019 





4739 





3100 





2866 





6423 





4773 





3729 





2845 





3807 


1 


6944 


K 





7754 


2 


0768 





9398 





7841 





3491 


1 


5543 


1 


3083 





5889 





7789 





3964 


M 





7232 





6226 





4745 





3465 





6114 





8643 





5003 





3955 





5841 


1 


4777 


F 





4649 





3807 





3543 





2990 





4390 





3340 





3307 





3406 





6520 





9458 


P 





7541 





4815 





4999 





5987 





3796 





6413 





6792 





4774 





4729 





3847 


S 


1 


4721 





7672 


1 


2315 





9135 





7384 





9656 





9504 





9036 





7367 





4432 


T 


u 


Q844 


n 
u 


R778 


n 
u 




n 
u 


fiQ4R 
uy^o 


n 
u 




n 
u 


7Q1 3 
t y 10 


n 
u 


741 4 


n 
u 


o t yo 


n 
u 


^7^ 
00 t 


n 
u 


77QS 
t t yo 


w 


o 


4165 


() 


3951 


() 


2778 


() 


2321 


() 


4500 


o 


5094 


() 


3743 


() 


4217 


() 


4441 


() 


4089 


Y 




5426 




5560 


n 

u 


4860 


n 

u 


3457 


n 


4342 


n 

u 


6111 


n 


4Q65 


n 


3487 

04:0 t 


I 


79 79 

tutu 


u 


6304 


v 

V 


n 


9365 


n 


4201 


n 


3690 


n 

u 


336^i 


n 

u 


75^8 


n 


4668 


n 

u 


4989 


n 

u 


3370 


n 

u 


3394 


2 


41 75 






L 




K 




M 




F 




p 




b 




T 




W 




Y 




V 


A 

A 





6019 





7754 





7232 





4649 





7541 


1 


4721 





9844 





4165 





5426 





9365 


R 





4739 


2 


0768 





6226 





3807 





4815 





7672 





6778 





3951 





5560 





4201 


N 





3100 





9398 





4745 





3543 





4999 


1 


2315 





9842 





2778 





4860 





3690 


D 





2866 





7841 





3465 





2990 





5987 





9135 





6948 





2321 





3457 





3365 


C 





6423 





3491 





6114 





4390 





3796 





7384 





7406 





4500 





4342 





7558 


Q 





4773 


1 


5543 





8643 





3340 





6413 





9656 





7913 





5094 





6111 





4668 


E 





3729 


1 


3083 





5003 





3307 





6792 





9504 





7414 





3743 





4965 





4289 


G 





2845 





5889 





3955 





3406 





4774 





9036 





5793 





4217 





3487 





3370 


H 





3807 





7789 





5841 





6520 





4729 





7367 





5575 





4441 


1 


7979 





3394 


I 


1 


6944 





3964 


1 


4777 





9458 





3847 





4432 





7798 





4089 





6304 


2 


4175 


L 


3 


7966 





4283 


1 


9943 


1 


1546 





3711 





4289 





6603 





5680 





6921 


1 


3142 


K 





4283 


4 


7643 





6253 





3440 





7038 





9319 





7929 





3589 





5322 





4565 


M 


1 


9943 





6253 


6 


4815 


1 


0044 





4239 





5986 





7938 





6103 





7084 


1 


2689 


F 


1 


1546 





3440 


1 


0044 


8 


1288 





2874 





4400 





4817 


1 


3744 


2 


7694 





7451 


P 





3711 





7038 





4239 





2874 


12 


8375 





7555 





6889 





2818 





3635 





4431 


S 





4289 





9319 





5986 





4400 





7555 


3 


8428 


1 


6139 





3853 





5575 





5652 


T 


o 


6603 


() 


7929 


o 


7938 


() 


4817 


() 


6889 


1 


6139 


4 


8321 


() 


4309 


() 


5732 


o 


9809 


W 





5680 





3589 





6103 


1 


3744 





2818 





3853 





4309 


38 


1078 


2 


1098 





3745 


Y 





6921 





5322 





7084 


2 


7694 





3635 





5575 





5732 


2 


1098 


9 


8322 





6580 


V 


1 


3142 





4565 


1 


2689 





7451 





4431 





5652 





9809 





3745 





6580 


3 


6922 



Table 8: The BLOSUM62-2 matrix. 
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A 


Alanine 


L 


Leucine 


R 


Arginine 


K 


Lysine 


N 


Asparagine 


M 


Methionine 


D 


Aspartic acid 


F 


Phenylalanine 


C 


Cysteine 


P 


Proline 


Q 


Glutamine 


S 


Serine 


E 


Glutamic acid 


T 


Threonine 


G 


Glycine 


W 


Tryptophan 


H 


Histidine 


Y 


Tyrosine 


I 


Isoleucine 


V 


Valine 



Table 9: The list of the amino acids. 



From the Introduction, we see that the matrix Q can be recovered from 
the BLOSUM62-2 once the marginal probability vector p is available. The 
latter vector is obtained by 

p= ([BLOSUM62-2])-%, 

where v x = (1, ■ • • , 1) e M 20 is a vector with all its coordinate being 1. The 

matrix Q can be obtained precisely from http : //www.ncbi .nlm.nih. gov/IEB/ToolBox/ 

CPP_DOC/lxr/source/src/algo/blast/ compos it ion_adjustment/ 

matrix_f requency_data. c#L391. 

References 

[1] N. Aronszajn. Theory of reproducing kernels. Transactions of the Amer- 
ican Mathematical Society, 68:337-404, 1950. 

[2] A. Baas, X.J. Gao, and G. Chelvanayagam. Peptide binding motifs and 
specificities for HLA-DQ molecules. Immunogenetics, 50:8-15, 1999. 

[3] E.E. Bittar and N. Bittar, editors. Principles of Medical Biology: Molec- 
ular and Cellular Pharmacology. JAI Press Inc., 1997. 

[4] F.A. Castelli, C. Buhot, A. Sanson, H. Zarour, S. Pouvelle-Moratille, 
C. Nonn, H. Gahery-Segard, J.-G. Guillet, A. Menez, B. Georges, and 
B. Maillere. HLA-DP4, the most frequent HLA II molecule, defines a 
new supertype of peptide-binding specificity. J. Immunol, 169:6928- 
6934, 2002. 

[5] F. Cucker and D.X. Zhou. Learning Theory: An Approximation Theory 
Viewpoint. Cambridge University Press, 2007. 

[6] W.H.E. Day and H. Edelsbrunner. Efficient algorithms for agglomera- 
tive hierarchical clustering methods. Journal of classification, l(l):7-24, 
1984. 

[7] LA. Doytchinova and D.R. Flower. In silico identification of supertypes 
for class II MHCs. J. Immunol, 174(ll):7085-7095, 2005. 

[8] Y. El-Manzalawy, D. Dobbs, and V. Honavar. On evaluating MHC-II 
binding peptide prediction methods. PLoS One, 3:e3268, 2008. 



29 



[9] M. Galan, E. Guivier, G. Caraux, N. Charbonnel, and J.-F. Cosson. 
A 454 multiplex sequencing method for rapid and reliable genotyping 
of highly polymorphic genes in large-scale studies. BMC Genomics, 
11(296), 2010. 

[10] Dan Graur and Wen-Hsiung Li. Fundamentals of molecular evolution. 
Sunderland, Mass.: Sinauer Associates, 2000. 

[11] W.W. Grody, R.M. Nakamura, F.L. Kiechle, and C. Strom. Molecular 
Diagnostics: Techniques and Applications for the Clinical Laboratory. 
Academic Press, 2010. 

[12] D. Haussler. Convolution kernels on discrete structures. Technical re- 
port, 1999. 

[13] S. Henikoff and J. G. Henikoff. Amino acid substitution matrices 
from protein blocks. Proceedings of the National Academy of Sciences, 
89:10915-10919, 1992. 

[14] R. Holdsworth, C.K. Hurley, S.G. Marsh, M. Lau, H.J. Noreen, J.H. 
Kempenich, M. Setterholm, and M. Maiers. The HLA dictionary 2008: 
a summary of HLA-A, -B, -C, -DRB1/3/4/5, and -DQB1 alleles and 
their association with serologically defined HLA-A, -B, -C, -DR, and 
-DQ antigens. Tissue Antigens, 73(2):95-170, 2009. 

[15] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge 
University Press, 1994. 

[16] C.A. Janeway, P. Travers, M. Walport, and M.J. Shlomchik. Immuno- 
biology (5th Edition). Garland Science, 2001. 

[17] C. Leslie, E. Eskin, and W.S. Noble. The spectrum kernel: a string ker- 
nel for SVM protein classification. Pacific Symposium on Biocomputing, 
7:566-575, 2002. 

[18] H. H. Lin, G. L. Zhang, S. Tongchusak, E. L. Reinherz, and V. Brusic. 
Evaluation of MHC-II peptide binding prediction servers: applications 
for vaccine research. BMC Bioinformatics, 9 (Suppl 12):S22, 2008. 

[19] O. Lund, M. Nielsen, C. Kesmir, A.G. Petersen, C. Lundegaard, 
P. Worning, C. Sylvester-Hvid, K. Lamberth, G. R0der, S. Justesen, 
S. Buus, and S. Brunak. Definition of supertypes for HLA molecules us- 
ing clustering of specificity matrices. Immunogenetics, 55(12):797-810, 
2004. 

[20] O. Lund, M. Nielsen, C. Lundegaard, C. Ke§mir, and S. Brunak. Im- 
munological Bioinformatics. The MIT Press, 2005. 

[21] M. Maiers, G.M. Schreuder, M. Lau, S.G. Marsh, M. Fernandes-Vi na, 
H. Noreen, M. Setterholm, and C. Katovich Hurley. Use of a neural 
network to assign serologic specificities to HLA-A, -B and -DRB1 allelic 
products. Tissue Antigens, 62(l):21-47, 2003. 



30 



[22] S.G.E. Marsh, E.D. Albert, W.F. Bodmer, R.E. Bontrop, B. Dupont, 
H.A. Erlich, M. Fernandez- Vi na, D.E. Geraghty, R. Holdsworth, C.K. 
Hurley, M. Lau, K.W. Lee, B. Mach, M. Maiersj, W.R. Mayr, C.R. 
Muller, P. Parham, E.W. Petersdorf, T. SasaZuki, J.L. Strominger, 
A. Svejgaard, P.I. Terasaki, J.M. Tiercy, and J. Trowsdale. Nomencla- 
ture for factors of the HLA system, 2010. Tissue Antigens, 75 (4): 291- 
455, 2010. 

[23] M. Nielsen, S. Justesen, O. Lund, C. Lundegaard, and S. Buus. 
NetMHCIIpan-2.0: Improved pan-specific HLA-DR predictions using a 
novel concurrent alignment and weight optimization training procedure. 
Immunome Research, 6(1):9, 2010. 

[24] M. Nielsen and O. Lund. NN-align. An artificial neural network-based 
alignment algorithm for MHC class II peptide binding prediction. BMC 
Bioinformatics, 10:296, 2009. 

[25] M. Nielsen, C. Lundegaard, T. Blicher, B. Peters, A. Sette, S. Justesen, 
S. Buus, and O. Lund. Quantitative predictions of peptide binding 
to any HLA-DR molecule of known sequence: NetMHCIIpan. PLoS 
Comput. Biol, 4(7):el000107, 2008. 

[26] D. Ou, L.A. Mitchell, and A.J. Tingle. A new categorization of HLA 
DR alleles on a functional basis. Hum. Immunol, 59(10):665-676, 1998. 

[27] J. Robinson, M.J. Waller, P. Parham, N. de Groot, R. Bontrop, L.J. 
Kennedy, P. Stoehr, and S.G. Marsh. IMGT/HLA and LMGT/MHC: 
Sequence databases for the study of the major histocompatibility com- 
plex. Nucleic Acids Res., 31(1):311-314, 2003. 

[28] R. Sadiq and S. Tesfamariam. Probability density functions based 
weights for ordered weighted averaging (OWA) operators: An exam- 
ple of water quality indices. European Journal of Operational Research, 
182(3):1350-1368, 2007. 

[29] H. Saigo, J. -P. Vert, N. Ueda, and T. Akutsu. Protein homology detec- 
tion using string alignment kernels. Bioinformatics, 20(11):1682-1689, 
jul 2004. 

[30] H. Saigo, J. P. Vert, and T. Akutsu. Optimizing amino acid substitution 
matrices with a local alignment kernel. BMC Bioinformatics, 7:246, 
2006. 

[31] J. Salomon and D.R. Flower. Predicting class II MHC-peptide binding: 
a kernel based approach using similarity scores. BMC Bioinformatics, 
7:501, 2006. 

[32] B. Scholkopf and A. J. Smola. Learning with Kernels. The MIT Press, 
2001. 

[33] A. Sette, L. Adorini, S.M. Colon, S. Buus, and H.M. Grey. Capac- 
ity of intact proteins to bind to MHC class II molecules. J Immunol, 
143(4):1265-1267, 1989. 



31 



[34] A. Sette and J. Sidney. Nine major HLA class I supertypes account for 
the vast preponderance of HLA-A and -B polymorphism. Immunogenet- 
ics, 50(3-4):201-212, 1999. 



[35] J. Shawe- Taylor and N. Cristianini. Kernel Methods for Pattern Analy- 
sis. Cambridge University Press, 2004. 

[36] J. Sidney, H.M. Grey, R.T. Kubo, and A. Sette. Practical, biochemical 
and evolutionary implications of the discovery of HLA class I supermo- 
tifs. Immunol. Today, 17(6):261-266, 1996. 

[37] J. Sidney, B. Peters, N. Frahm, C. Brander, and A. Sette. HLA class 
I supertypes: a revised and updated classification. BMC Immunology, 
9(1), 2008. 

[38] S. Smale, L. Rosasco, J. Bouvrie, A. Caponnetto, and T. Poggio. Math- 
ematics of the neural response. Foundations of Computational Mathe- 
matics, 10(1):67-91, 2010. 

[39] S. Southwood, J. Sidney, A. Kondo, M.F. del Guercio, E. Appella, 
S. Hoffman, R.T. Kubo, R.W. Chesnut, H.M. Grey, and A. Sette. Sev- 
eral common HLA-DR types share largely overlapping peptide binding 
repertoires. The Journal of Immunology, 160(7):3363-3373, 1998. 

[40] Glenys Thomson, Nishanth Marthandan, Jill A. Hollenbach, Steven J. 
Mack, Henry A. Erlich, Richard M. Single, Matthew J. Waller, Steven 
G. E. Marsh, Paula A. Guidry, David R. Karp, Richard H. Scheuer- 
mann, Susan D. Thompson, David N. Glass, and Wolfgang Helmberg. 
Sequence feature variant type (SFVT) analysis of the HLA genetic as- 
sociation in juvenile idiopathic arthritis. In Pacific Symposium on Bio- 
computmg'2010, pages 359-370, 2010. 

[41] P. Wang, J. Sidney, C. Dow, B. Mothe, A. Sette, and B. Peters. A 
systematic assessment of MHC Class II peptide binding predictions 
and evaluation of a consensus approach. PLoS Computational Biology, 
4:el000048, 2008. 

[42] R.R. Yager. On ordered weighted averaging aggregation operators in 
multicriteria decisionmaking. IEEE Trans, on Systems, Man and Cy- 
bernetics, 18(1):183-190, 1988. 

[43] JW. Yewdell and J.R. Bennink. Immuno dominance in major histocom- 
patibility complex class I-restricted T lymphocyte responses. Annu Rev 
Immunol, 17:51-88, 1999. 



32 



